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Abstract 

The University of Sheffield (USFD) participated in the 
International Workshop for Spoken Language Translation 
(IWSLT) in 2014. In this paper, we will introduce the 
USFD SLT system for IWSLT Automatic speech recognition 
(ASR) is achieved by two multi-pass deep neural network 
systems with adaptation and rescoring techniques. Machine 
translation (MT) is achieved by a phrase-based system. The 
USFD primary system incorporates state-of-the-art ASR and 
MT techniques and gives a BLEU score of 23.45 and 14.75 
on the English-to-French and English-to-German speech-to- 
text translation task with the IWSLT 2014 data. The USED 
contrastive systems explore the integration of ASR and MT 
by using a quality estimation system to rescore the ASR out¬ 
puts, optimising towards better translation. This gives a fur¬ 
ther 0.54 and 0.26 BLEU improvement respectively on the 
IWSLT 2012 and 2014 evaluation data. 

1. Introduction 

In this paper, the University of Sheffield (USED) system for 
the International Workshop on Spoken Language Translation 
(IWSLT) 2014 is introduced. USED participated in English- 
to-Erench and English-to-German SLT tasks. The ASR and 
MT systems made use of state-of-the-art technologies. On 
the ASR side, two deep neural network systems built on par¬ 
tially different data and different tandem configurations were 
used. On the MT side, phrase-based translation models were 
built. ASR and MT system integration attempts were made 
by using a translation quality estimation system. It consid¬ 
ered the system scores from both ASR and MT, as well as 
features extracted from the ASR outputs in source language. 
The ASR hypotheses were then rescored based on the pre¬ 
dicted translation quality. This gives performance improve¬ 
ments in terms of BLEU score increase. 

In the following, the data used for system training is in¬ 
troduced in ^ ^and (Qgive the details of the ASR and 
MT systems. The decoding algorithm and system results are 
given in ^ Besides the primary submission, USED also 
submitted contrastive systems which implement system in¬ 
tegration. These systems used a quality estimation module 
and performed ASR W-best list rescoring based on predicted 
translation quality. This would be described in 


2. Data processing and selection 

The ASR and MT systems were primarily trained on TED 
lecture data ID. Eor ASR, TED and the additional data form 
two data subsets, on which two systems were trained. Eor 
MT, out-of-domain data after data selection were incorpo¬ 
rated in the training of translation models and target language 
models. 

2.1. ASR acoustic modelling 

Two data sets were used for ASR system training. Eor the 
ease of discussion they are hereinafter referi'ed to as ASRi 
and ASR 2 . The composition of the two data sets is shown in 
Table □ 


Table 1; Data for acoustic model training 
ASRi ASR2 

Data Hours Data Hours 


TED 

132 

TED 

112 

LLC 

106 

AMI-fAMIDA-fICSI 

165 

ECRN 

60 

ECRN 

60 


TED serves as a common data set in both ASRi and 
ASR 2 . Their segmentations in ASRi and ASR 2 differ 
slightly and this is explained later. The two data sets are aug¬ 
mented by e-corner lecture data (ECRN) with a duration of 
60 hours d. ASRi also contains 106 hours of LLC lecture 
data. In ASR 2 , 165 hours of meeting data from the AMI, 
AMIDA and ICSI corpora are added so the trained model 
will reflect also generic domains other than lectures Ellll. 

The TED portions in both ASRi and ASR 2 originate 
from 734 TED talks published before 31 Dec 2010. Each 
talk has a duration of around 15 minutes. Human annota¬ 
tions in the form of subtitles are also available, giving rough 
segmentation with segment duration from 3 to 5 seconds and 
time accuracy to the nearest second. 

Exact segmentations and transcriptions of TED were de¬ 
rived in different ways in ASRi and ASR 2 . In ASRi, all seg¬ 
ments from the same talk were merged and the speech was 
forced aligned, resegmented before another forced alignment 
run determined the final training set. This gave a total of 132 
hours of speech for AM training. In ASR 2 , forced alignment 





Table 2; Amount of text data used in different training tasks 
in En—>'Fr translation (#Full data set was used for bulling target LM) 


Number of words/million 

Data Target LM# Source LM PunctTM TM 


TED 

3.17 

3.17 

3.17 

3.17 

News Commentary 

4.0 

0.9 

0.2 

0.7 

Common crawl 

70.7 

36.1 

3.6 

10.8 

Gigaword 

575.7 

271.2 

26.3 

14.9 

Europarl 

50.3 

10.8 

4.3 

1.9 


was performed on the rough segmentation, after which con¬ 
tagious segments were merged when there was tight silence 
at the segment boundaries. A further ran of forced alignment 
determined the hnal training set. This gave a total of 112 
hours of speech. 

To evaluate the performance of different segmentations, 
PLP-based state-tied triphone models with cepstral mean and 
variance normalisation were trained on these data and decod¬ 
ing was performed on the IWSLT 2010 evaluation data set. 
The WERs for the ASRi and ASR 2 settings are 25.7% and 
26.2% respectively. When the models are trained directly 
on the roughly segmented data (no adjustment of segmenta¬ 
tions), the total duration of training data is 109 hours and the 
corresponding WER is 28.1%. 

2.2. Language models and MT 

Textual data for the training of language models and transla¬ 
tion models were obtained from the affiliated websites of the 
IWSLT and WMT evaluations QlSl. TED was considered 
as the in-domain training data and the full data set was used. 
Four out-of-domain (OOD) data sets from News commen¬ 
tary v9. Common Crawl, Gigaword and Europarl v7 were 
also used, after a data selection process. 

The OOD corpora were selected with the cross en¬ 
tropy difference criterion Q. Given a sentence x{ = 
[xi ■■■Xi] with I words, cross entropy values H{x{,ID) 
and H{x{,OOD) were computed using Qid, the ID lan¬ 
guage model (in this case, TED) and Good, the OOD lan¬ 
guage model (built on the corpus from which the sentence 
was taken). The cross entropy difference (CED) was given 

by, 

CED(a;() = HixlGio) - 7T(a:(, W) (D 

Sentences were ranked by the CED values and 25% of the 
sentences with the lowest CED values were selected from 
each corpus. Furthermore, CED values were calculated on 
sentence batches with increasing sizes. A line search was 
done to hnd the optimal batch giving the minimum CED 
value. All data selection was done on the English text. For 
data selection to translation model training, the correspond¬ 
ing sentences in the target languages were extracted after se¬ 
lection was done on English sentences. 

Table 1^ shows the amount of the full text data set, and the 


selected text data in different systems in the English—^French 
translation task. The full data set contains 703.9M words. 
They were used for training the target language model in 
MT, which was a 5-gram interpolated LM with punctuation 
and out-of-vocabulary word modelling, modified Kneser- 
Ney smoothing and was in standard ARPA format. The 
source language model for ASR was built on the full TED 
data set and 25% or 50% of the OOD data, making up to 
322.2M words. A monolingual translation model was trained 
for punctuation insertion and case conversion. The training 
took the full TED data and 5-10% of the OOD data, result¬ 
ing in a total of 37.6M words. The translation model was 
trained on the full TED data set and other optimally selected 
OOD data sets, where only around 5% of the sentences were 
selected. The total number of words is 31.7M. 

3. Automatic speech recognition 

There are two DNN systems with tandem conhgurations in 
ASR H. Bottleneck (BN) features were derived from deep 
neural network (DNN)s a, and GMM-HMM systems were 
trained on these bottleneck feamres. The two tandem sys¬ 
tems were trained on ASRi and ASR 2 data respectively (Ta- 
ble[T]). Different portions of data were used in different stages 
of training. Let DNNi and DNN 2 denote the two DNN sys¬ 
tems for ASRi ASR 2 . DNNi was trained on TED data 
only. DNN 2 was trained on TED and AMH-AMIDA-flCSI 
data only. The remaining data listed in Table were added 
to the training pool in the GMM-HMM training stage. 

DNNi has 4 hidden layers, each having 1,745 hidden 
units. The BN layer is placed just before the output layer 
and has 26 units. The output layer has 4,320 units. DNN 2 
has 5 hidden layers, with the first 3 layers having 1,745 units 
and the fourth hidden layer having 65 units. A BN layer is 
placed just before the output layer and has 39 units. The out¬ 
put layer has 5,691 units. 

Both the DNNs were trained using log filter-bank outputs 
and concatenating 31 adjacent frames, which were decorre- 
lated using DCT to form a 368-dimensional feature vector. 
The hlter-bank outputs were mean and variance normalised 
at the speaker level. Global mean and variance normalisation 
was performed on each dimension before feeding the input 
for training the DNN. The GMM-HMM systems trained us¬ 
ing the BN features were different. The model for ASRi was 
trained on the concatenated features with the 26-dimension 
BN features from DNNi and the 39-dimension PEP features. 
The model for ASR 2 was trained on the 39-dimension BN 
features from DNN 2 . Both the GMM-HMM models were 
trained as tied-state triphone systems with the hnal models 
having 16 mixture Gaussians per state. 

All systems are vocal tract length normalised (VTLN). In 
the training stage, a PEP system was used to obtain the warp 
factors for each speaker. Then the hlter-bank and PLP fea¬ 
tures were VTLN-warped, which were in turn used for DNN 
and GMM-HMM training in the tandem conhguration. In the 
decoding stage, a non-VTLN DNN and GMM-HMM tandem 







ASR 2 


Speech- 


Non VTLN 
system 



ASR_1 








> 

VTLN 

system 

->■ 

Cross 

MLLR 


MLLR 


AM & LM 
rescoring 
with lattice 












VTLN 

system 


Cross 

MLLR 


MLLR 


AM & LM 
rescoring 
with lattice 


ROVER 


Decoded 
text (source 
langauge) 


Figure 1; System diagram for multi-pass ASR decoding. 


system trained on ASR 2 data replaced the PLP system for the 
derivation of warp factors. 

To improve the performance of the acoustic model, mini¬ 
mum phone error (MPE) training was performed using the 
lattices which were generated using a uni-gram language 
model ||9|. 

Language models for ASR are all interpolated LMs built 
on the English text data described in Table and tuned 
on IWSLT 2010 dev and eval data. 2-gram and 4-gram 
ARPA language models were trained for lattice generation 
and expansion. The 4-gram LM was pruned with a threshold 
10“^° and a weighted-finite-state transducer (WEST) was 
constructed for fast decoding in the pre-final passes in the 
ASR systems. 

All ASR LMs were based on a word-list with a 60k word 
vocabulary extracted based on our standard English ASR in¬ 
ventory and the English part of the TED MT training data for 
IWSLT 2014 Eia. Pilot ASR experiments on the IWSLT 
2011 and 2012 eval data show the drop of perplexity with 
the addition of Common crawl and Gigaword data. Eor these 
two corpora, the rate of data selected for LM building was 
set to 50%, while the rate for other OOD corpora was kept 
25%. This made the total number of words 322.2M as shown 
in Table |2] 

Pronunciation probabilities were incorporated in final 
stage decoding Qo). These probabilities were extracted 
based on the Viterbi alignment of the phoneme level tran¬ 
scription of the ASRi training data. When a word allowed 
multiple pronunciations, the frequency of each pronunciation 
was calculated and stored. These frequencies were then ap¬ 
plied to the words in the decoding dictionary for words that 
appeared in both training and decoding stages. Words with 
multiple pronunciations appearing only in the decoding stage 
were given equal probability. 


4. Machine translation 

A phrase-based model using MoSES m in a standard set¬ 
ting was employed. Eor phrase extraction all of the TED data 
(3.17 million words) was used. Eollowing previous findings 
ca, data selection via a cross-entropy difference criterion 
(detailed in \2.2\ was used to select the optimal batch of 


the OOD data, which amounts to about 5% of the total data 
or 30.58M words. The phrase length was limited to 5 and 
word-alignment was obtained with EastAlign OS. Lexi- 
calised reordering models were trained using the same data. 
Eor language modelling, we used the complete sets of OOD 
data (i.e. no data selection). 5-gram LMs were trained us¬ 
ing LMPLZ ifTrl . 100-best MIRA tuning was employed ifTSlI . 
Eor the English-to-Erench system, tuning was done on the 
IWSLT 2010 development and evaluation data with a total 
of 2,551 sentences. Eor the English-to-German system, tun¬ 
ing was done on the IWSLT 2010 development data with 887 
sentences. 

In SET, the input to the MT system was ASR output, 
which typically lacks casing and punctuation. Eollowing pre¬ 
vious work Gsiini, a monolingual translation system was 
trained to recover casing and punctuation from the ASR out¬ 
put, thus producing source sentences which are more ade¬ 
quate for translation. The training data for this monolingual 
MT system was obtained by pre-processing an actual corpus 
of the source language to form pseudo ASR outputs, which 
contained no case and punctuation information. Numbers, 
symbols and acronyms were also converted to their verbal 
forms with lookup tables. We then used this synthesised cor¬ 
pus of pseudo ASR as the source, and the original corpus as 
the target of our monolingual MT. The monolingual transla¬ 
tion system was trained on 37.6M words (Table [^. It per¬ 
formed monotonic translation with phrases of as long as 7 
words. 


5. Decoding 

The evaluation systems for ASR and MT are multi-pass sys¬ 
tems with resource optimisation and environment manage¬ 
ment capabilities uniiisi. The ASR is a two-stream multi¬ 
pass system. It is illustrated in Eigure [T] The two streams 
ASRi and ASR 2 differ by the acoustic model training data 
(detailed in Table [T]) and also the tandem configurations (de¬ 
tailed in Q. Both streams follow the same routine along 
the multi-pass decoding system. In pass 1, a unified de¬ 
coding result was generated using a non-VTLN DNN and 
GMM-HMM tandem system with cepstral mean and vari¬ 
ance (CMVN) normalisation trained on ASR 2 data. These 



















Table 3: Tree-search and WFST decoder 



Tstll 


Tstl2 

Decoder 

WER 

RT 

WER 

RT 

Tree-search 

23.7% 

18.4 

27.0% 

19.8 

WFST 

23.7% 

3.0 

27.0% 

3.3 


hypothesis transcripts were used for inferring the warp fac¬ 
tors. The filterbank (for both ASRi and ASR 2 ) and PLP 
(for ASRi only) features were then warped and CMVN nor¬ 
malised, and the system branched off into two streams with 
two VTLN decoders trained on ASRi and ASR 2 data respec¬ 
tively. 

After pass 2 decoding, speaker-based MLLR cross adap¬ 
tations were carried out. The transcripts from ASRi was used 
for the model transformation in ASR 2 system and vice versa. 
The number of regression classes was set to 16. When pass 3 
decoding was done, MLLR self adaptations were performed. 
The number of regression classes was also set to 16. 

All pre-final stage decoding made use of weighted finite 
state transducers (WFSTs) for fast implementation. In a pi¬ 
lot experiment, PLP systems with heteroscedastic linear dis¬ 
criminant analysis (HLDA) were trained on the ASR 2 data 
M- WFST decoding with a pruned 4-gram grammar net¬ 
work was compared with the standard tree search with an 
unpruned 3-gram LM. The WER and real-time factor (RT) 
on IWSLT 2011 evaluation and IWSLT 2012 evaluation data 
are shown in Table [3 WFST was shown to achieve the same 
performance as tree-search decoding, with much faster de¬ 
coding speed. 

In the final stage, acoustic and language model rescoring 
were performed. Base lattices were generated with 2-gram 
LM pruned with a threshold 10“^°. Lattice expansion was 
done with 4-gram unpruned language models. Three settings 
were tried and the results were compared, 

(i) Language model rescoring with the 4-gram LM 

(ii) Considering pronunciation probability (Pron. prob.) 
on top of (i) 

(iii) Acoustic and language model rescoring with the set¬ 
ting of (ii) 

ASR performance in terms of WER are shown in Table 

The initial non-VTLN system gave WER of 16.9% and 
17.7% on IWSLT 2011 and 2012 data respectively. Moving 
towards the VTLN systems, when ASRi and ASR 2 branched 
off, it is observed that the ASRi model gave 1.0% to 1.4% 
lower WER than the ASR 2 model. This is because the data 
in ASRi had a better match in terms of domain. Incremental 
performance gains can be observed in individual steps, par¬ 
ticularly MPE, cross-adaptation and language model rescor¬ 
ing. The WER difference between ASRi and ASR 2 dimin¬ 
ished to 0.4-0.5% after all optimisation steps. After system 
combination, the final WER is 21-25% relatively lower com¬ 
pared with the initial system. 

MT Decoding was performed with cube pruning ll20l 
both in tuning and testing. Decoding was done with the min- 


Table 4: WER of the multi-pass ASR systems 
Tstll Tstl2 


ASR system 

ASRi 

ASR 2 

ASRi 

ASR 2 

Non-VTLN 

- 

16.9% 

- 

17.7% 

-fVTLN 

15.4% 

16.4% 

16.4% 

16.8% 

-hMPE 

14.7% 

15.7% 

16.0% 

16.1% 

-hCross-adapt 

14.0% 

14.9% 

14.2% 

14.8% 

-hSelf-adapt 

14.0% 

15.0% 

14.2% 

14.7% 

-|-LM rescoring 

13.4% 

14.5% 

13.5% 

14.2% 

-hPron. prob. 

13.3% 

14.2% 

13.4% 

14.0% 

-|-AM rescoring 

13.3% 

13.8% 

13.4% 

13.7% 

ROVER 

—13.3%— 

—13.2%— 


Table 5; MT system performance on eval data 

BLEU(c) 


Language pair Dev 10 Tstl2 

(MT with true transcript) 

En—>-Fr 40.9 

En^De 21.5 

(Monolingual translation) 

En(pseudo ASR)—>En 88.0 

En(ASR)-)>En 69.0 

(SLT) 

En(ASR)^En^Fr 31.7 

En(ASR)-)>En-)>De 16.8 


imum Bayes risk criterion and reordering over punctuations 
was forbidden. To restore the correct case of the output the 
truecasing heuristic was employed. The same set of standard 
techniques was applied on En—>^Er and En—>De translation. 

The MT system was tested on IWSLT 2010 development 
data and 2012 evaluation data, and the results are shown 
in Table |5] Performance are shown in terms of cased and 
punctuated BLEU scores. When given the reference tran¬ 
script, the MT system gave 40.9 and 21.5 BLEU score for 
MT tasks in En—>^Er and En—>De respectively. The mono¬ 
lingual translation system (Q restored case and punctuation 
information. It was tested on pseudo ASR and real ASR out¬ 
put and yielded 88.0 and 69.0 BLEU score. Finally in the 
SLT setting, the decoded ASR result was fed to the mono¬ 
lingual translation system and the output were subsequently 
translated. The BLEU score is 31.7 and 16.8 for SLT tasks 
in En—>^Er and En—>De respectively. 

In Table 1^ the official IWSLT 2014 evaluation perfor¬ 
mance in terms of BLEU and TER (cased, punctuated and 
non-case, non-punctuated) for the USED primary system is 
shown. 

Table 6: Primary SLT system performance (Tstl4) 
Language pair BLEU(c) TER(c) BLEU TER 

En^Er 23.45 59.94 24.14 58.97 

En^De 14.75 70.15 15.24 69.15 
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Figure 2: System integration with ASR and MT 


6. System integration 

The USFD primary system is a pipeline SLT system in which 
1-best ASR result was directly fed to the MT system. System 
integration experiments were tried in the En—>Fr SLT task 
and the results were submitted as contrastive systems. Fig¬ 
ure [^depicts the integrated system and its comparison with 
the pipeline system. In the integrated system, ASR system 
hypotheses are expanded in the form of lattices, confusion 
networks or A^-best lists. A quality estimation (QE) module 
evaluated and rescored the ASR outputs before they were fed 
to the MT system. 

In our implementation, 10-best outputs from the ASR 
system on the IWSLT 2011 evaluation data were used for QE 
training. The QE module derived 117 QuEst 112111^ features 
from each sentence to describe its linguistic, statistical prop¬ 
erties as well as the statistics from the ASR and MT models. 
Out of the 117 features, top 58 features were selected us¬ 
ing the Gaussian Process (GP) with RBE kernel as described 
in E^ . Eurther, GP was used to learn the relationship be¬ 
tween the selected features and the translation performance 
of the sentence (in this case, sentence-based METEOR score) 
124]. During testing, the estimated translation performance 
was used to rescore the 10-best ASR output. Details of the 
integrated system were described in 1251 . 


Table 7: Contrastive SLT system performance (En—>Er) 


Setting 

Tstl2 

Tstl4 

Contrastive 1 (baseline) 

31.33 

23.18 

Contrastive 2 



(-f 10-best list rescoring) 

31.51 

23.27 

Contrastive 3 



(-b ASR confidence-informed rescoring) 

31.87 

23.44 


The ROVER combination of ASRi and ASR 2 systems 
only provided 1-best output. In the integration experiment, 
the 10-best output from ASRi was used instead. 

Performance of the contrastive systems in terms of cased 
and punctuated BLEU score is shown in Table]^ Contrastive 


1 result is from the baseline system with pipeline setting. 
Contrastive 2 and 3 show the results of two different system 
integration settings. The baseline system gave BLEU scores 
31.33 and 23.18 on IWSLT 2012 and IWSLT 2014 data. The 
baseline numbers are inferior to the primary system number 
(IWSLT 2012; 31.7; IWSLT 2014: 23.45) as shown in Ta¬ 
ble |5] and This is because the baseline here did not benefit 
from ASR system combination. 

Rescoring gives 0.18 and 0.09 BLEU improvements to 
IWSLT 2012 and IWSLT 2014 data respectively. By in¬ 
specting the results, it was found that rescoring generally had 
higher effectiveness for the sentences with low ASR confi¬ 
dence. Therefore, a confidence threshold was set, and rescor¬ 
ing was only performed when the ASR confidence dropped 
below this threshold. Eor IWSLT 2012 data, optimality was 
reached when 55% of the sentences were selected by this 
confidence criteria to rescore, resulting a further 0.36 BLEU 
score gain. This threshold was applied on IWSLT 2014 data, 
a 0.17 BLEU score gain was observed. 

7. Summary 

In this paper, the USED SLT system for IWSLT 2014 was 
described. Automatic speech recognition (ASR) is achieved 
by two multi-pass deep neural network systems with slightly 
different tandem configurations and different training data. 
Machine translation (MT) is achieved by a monolingual 
phrase-based monotonic translation system which recovers 
case and inserts punctuation, followed by a bilingual phrase- 
based translation system. The USED contrastive systems ex¬ 
plore the integration of ASR and MT by using a quality es¬ 
timation system to rescore the ASR outputs, optimising to¬ 
wards better translation. This gives noticeable BLEU im¬ 
provement on the IWSLT 2012 and 2014 evaluation data. 
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